102 research outputs found
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
The Hoffmann-Jorgensen inequality in metric semigroups
We prove a refinement of the inequality by Hoffmann-Jorgensen that is
significant for three reasons. First, our result improves on the
state-of-the-art even for real-valued random variables. Second, the result
unifies several versions in the Banach space literature, including those by
Johnson and Schechtman [Ann. Probab. 17 (1989)], Klass and Nowicki [Ann.
Probab. 28 (2000)], and Hitczenko and Montgomery-Smith [Ann. Probab. 29
(2001)]. Finally, we show that the Hoffmann-Jorgensen inequality (including our
generalized version) holds not only in Banach spaces but more generally, in a
very primitive mathematical framework required to state the inequality: a
metric semigroup . This includes normed linear spaces as well as
all compact, discrete, or (connected) abelian Lie groups.Comment: 11 pages, published in the Annals of Probability. The Introduction
section shares motivating examples with arXiv:1506.0260
Retaining positive definiteness in thresholded matrices
Positive definite (p.d.) matrices arise naturally in many areas within
mathematics and also feature extensively in scientific applications. In modern
high-dimensional applications, a common approach to finding sparse positive
definite matrices is to threshold their small off-diagonal elements. This
thresholding, sometimes referred to as hard-thresholding, sets small elements
to zero. Thresholding has the attractive property that the resulting matrices
are sparse, and are thus easier to interpret and work with. In many
applications, it is often required, and thus implicitly assumed, that
thresholded matrices retain positive definiteness. In this paper we formally
investigate the algebraic properties of p.d. matrices which are thresholded. We
demonstrate that for positive definiteness to be preserved, the pattern of
elements to be set to zero has to necessarily correspond to a graph which is a
union of disconnected complete components. This result rigorously demonstrates
that, except in special cases, positive definiteness can be easily lost. We
then proceed to demonstrate that the class of diagonally dominant matrices is
not maximal in terms of retaining positive definiteness when thresholded.
Consequently, we derive characterizations of matrices which retain positive
definiteness when thresholded with respect to important classes of graphs. In
particular, we demonstrate that retaining positive definiteness upon
thresholding is governed by complex algebraic conditions
Integration and measures on the space of countable labelled graphs
In this paper we develop a rigorous foundation for the study of integration
and measures on the space of all graphs defined on a countable
labelled vertex set . We first study several interrelated -algebras
and a large family of probability measures on graph space. We then focus on a
"dyadic" Hamming distance function , which was
very useful in the study of differentiation on . The function
is shown to be a Haar measure-preserving
bijection from the subset of infinite graphs to the circle (with the
Haar/Lebesgue measure), thereby naturally identifying the two spaces. As a
consequence, we establish a "change of variables" formula that enables the
transfer of the Riemann-Lebesgue theory on to graph space
. This also complements previous work in which a theory of
Newton-Leibnitz differentiation was transferred from the real line to
for countable . Finally, we identify the Pontryagin dual of
, and characterize the positive definite functions on
.Comment: 15 pages, LaTe
The Khinchin-Kahane and Levy inequalities for abelian metric groups, and transfer from normed (abelian semi)groups to Banach spaces
The Khinchin-Kahane inequality is a fundamental result in the probability
literature, with the most general version to date holding in Banach spaces.
Motivated by modern settings and applications, we generalize this inequality to
arbitrary metric groups which are abelian.
If instead of abelian one assumes the group's metric to be a norm (i.e.,
-homogeneous), then we explain how the inequality improves to
the same one as in Banach spaces. This occurs via a "transfer principle" that
helps carry over questions involving normed metric groups and abelian normed
semigroups into the Banach space framework. This principle also extends the
notion of the expectation to random variables with values in arbitrary abelian
normed metric semigroups . We provide additional applications,
including studying weakly -valued sequences and related
Rademacher series.
On a related note, we also formulate a "general" Levy inequality, with two
features: (i) It subsumes several known variants in the Banach space
literature; and (ii) We show the inequality in the minimal framework required
to state it: abelian metric groups.Comment: 15 pages, Introduction section shares motivating examples with
arXiv:1506.02605. Significant revisions to the exposition. Final version, to
appear in Journal of Mathematical Analysis and Application
A Methodology for Robust Multiproxy Paleoclimate Reconstructions and Modeling of Temperature Conditional Quantiles
Great strides have been made in the field of reconstructing past temperatures
based on models relating temperature to temperature-sensitive paleoclimate
proxies. One of the goals of such reconstructions is to assess if current
climate is anomalous in a millennial context. These regression based approaches
model the conditional mean of the temperature distribution as a function of
paleoclimate proxies (or vice versa). Some of the recent focus in the area has
considered methods which help reduce the uncertainty inherent in such
statistical paleoclimate reconstructions, with the ultimate goal of improving
the confidence that can be attached to such endeavors. A second important
scientific focus in the subject area is the area of forward models for proxies,
the goal of which is to understand the way paleoclimate proxies are driven by
temperature and other environmental variables. In this paper we introduce novel
statistical methodology for (1) quantile regression with autoregressive
residual structure, (2) estimation of corresponding model parameters, (3)
development of a rigorous framework for specifying uncertainty estimates of
quantities of interest, yielding (4) statistical byproducts that address the
two scientific foci discussed above. Our statistical methodology demonstrably
produces a more robust reconstruction than is possible by using
conditional-mean-fitting methods. Our reconstruction shares some of the common
features of past reconstructions, but also gains useful insights. More
importantly, we are able to demonstrate a significantly smaller uncertainty
than that from previous regression methods. In addition, the quantile
regression component allows us to model, in a more complete and flexible way
than least squares, the conditional distribution of temperature given proxies.
This relationship can be used to inform forward models relating how proxies are
driven by temperature
- …